-
Notifications
You must be signed in to change notification settings - Fork 0
added persistent caching for the uploaded files #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements persistent disk-based caching for uploaded files, replacing the in-memory LRU cache with a persistent FanoutCache from the diskcache library.
Key Changes
- Persistent cache implementation: Replaces
cachetools.LRUCachewithdiskcache.FanoutCacheto persist uploaded file metadata across application restarts - Cache configuration: Cache stored at
~/.rapidata/upload_cachewith 100MB size limit and sharding based on upload worker count - Dependency updates: Removed
cachetoolsdependency and addeddiskcache ^5.6.3
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/rapidata/rapidata_client/datapoints/_asset_uploader.py | Replaced in-memory LRU cache with persistent FanoutCache, updated cache access pattern from dictionary-style to .get() method |
| pyproject.toml | Replaced cachetools dependency with diskcache; added unrelated dev dependencies (ipykernel, jupyter) |
| poetry.lock | Updated lock file with new dependencies and transitive dependencies for diskcache, ipykernel, and jupyter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _shared_upload_cache: FanoutCache = FanoutCache( | ||
| Path.home() / ".rapidata" / "upload_cache", | ||
| shards=rapidata_config.upload.maxWorkers, | ||
| timeout=0.01, |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout of 0.01 seconds (10 milliseconds) is extremely low for a disk cache operation. This could cause frequent timeout errors when multiple processes or threads try to access the cache simultaneously. Consider increasing this to at least 1 second (timeout=1.0) or using the default timeout value to avoid cache access failures under concurrent load.
| timeout=0.01, | |
| timeout=1.0, |
| _shared_upload_cache: FanoutCache = FanoutCache( | ||
| Path.home() / ".rapidata" / "upload_cache", | ||
| shards=rapidata_config.upload.maxWorkers, | ||
| timeout=0.01, | ||
| size_limit=100_000_000, # 100MB | ||
| ) |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FanoutCache is initialized at class definition time, which means it creates the cache directory immediately when the module is imported, even if AssetUploader is never used. This could cause issues if the .rapidata directory cannot be created (e.g., in read-only environments or CI/CD pipelines). Consider lazy initialization by moving the cache creation to __init__ or using a class method with @functools.cached_property to defer the cache creation until first use.
| ipykernel = "^7.1.0" | ||
| jupyter = "^1.1.1" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ipykernel and jupyter dependencies are added to the dev dependencies group, but they are unrelated to the persistent caching feature described in the PR title. These should be removed unless they are part of a separate change, or this PR should be split into two separate PRs to keep changes focused and easier to review.
| ipykernel = "^7.1.0" | |
| jupyter = "^1.1.1" |
Pull Request Review: Added Persistent Caching for Uploaded FilesSummaryThis PR adds persistent disk-based caching for uploaded files using the Code Quality and Best Practices✅ Positive Aspects
|
Pull Request Review: Persistent Caching for Uploaded FilesSummaryThis PR adds persistent caching functionality for uploaded files using 🟢 Strengths
🔴 Critical Issues1. Race Condition in Config Update Handler (src/rapidata/rapidata_client/config/upload_config.py:44-46)def __setattr__(self, name: str, value) -> None:
super().__setattr__(name, value)
self._notify_handlers() # Called on EVERY attribute setProblem:
Recommendation:
Example fix: def __init__(self, **kwargs):
self._initializing = True
super().__init__(**kwargs)
self._initializing = False
self._notify_handlers()
def __setattr__(self, name: str, value) -> None:
super().__setattr__(name, value)
if not getattr(self, '_initializing', False):
self._notify_handlers()2. Thread Safety Concerns (src/rapidata/rapidata_client/datapoints/_asset_uploader.py:31-62)Problem: The
Recommendation: Add a lock around cache operations and config updates: import threading
class AssetUploader:
_cache_lock = threading.RLock()
_shared_upload_cache: FanoutCache = ...
def upload_asset(self, asset: str) -> str:
with self._cache_lock:
# ... cache operations
@classmethod
def _handle_config_update(cls, config):
with cls._cache_lock:
# ... cache recreation3. Potential Memory Leak with Handler Registration (src/rapidata/rapidata_client/datapoints/_asset_uploader.py:29)def __init__(self, openapi_service: OpenAPIService):
self.openapi_service = openapi_service
register_upload_config_handler(self._handle_config_update) # Leaks!Problem: Handlers are registered but never unregistered. Each Recommendation:
🟡 Important Issues4. Silent Failures in Handler Notification (src/rapidata/rapidata_client/config/upload_config.py:50-54)for handler in _upload_config_handlers:
try:
handler(self)
except Exception as e:
print(f"Warning: UploadConfig handler failed: {e}") # Only prints!Problem: Exceptions are caught and only printed, not logged. This could hide serious issues. Recommendation: Use the logger: from rapidata.rapidata_client.config import logger
def _notify_handlers(self) -> None:
for handler in _upload_config_handlers:
try:
handler(self)
except Exception as e:
logger.warning(f"UploadConfig handler failed: {e}", exc_info=True)5. No Cache Directory Creation (src/rapidata/rapidata_client/datapoints/_asset_uploader.py:19-24)Problem: Recommendation: Add directory creation: @classmethod
def _handle_config_update(cls, config):
logger.debug("Updating AssetUploader shared upload cache with new config")
try:
# Ensure directory exists
config.cacheLocation.mkdir(parents=True, exist_ok=True)
# ... rest of the method6. No Cache Eviction Strategy DocumentationProblem: The PR sets
Recommendation: Add documentation in the docstring explaining the eviction policy (LRU by default for diskcache). 🟡 Code Quality Issues7. Inconsistent Naming ConventionsThe config uses camelCase ( 8. Missing Type Hints (src/rapidata/rapidata_client/config/upload_config.py:40-46)def __setattr__(self, name: str, value) -> None: # 'value' lacks type hintRecommendation: Add 9. Magic Numbers (src/rapidata/rapidata_client/config/upload_config.py:38)cacheSizeLimit: int = Field(default=100_000_000) # 100MBRecommendation: Define as a constant with documentation: DEFAULT_CACHE_SIZE_MB = 100
cacheSizeLimit: int = Field(default=DEFAULT_CACHE_SIZE_MB * 1024 * 1024)🔵 Testing Concerns10. No Tests for New FunctionalityCritical Gap: The PR adds significant new functionality but includes no tests. Required test coverage:
Recommendation: Add tests to 🟢 Performance Considerations11. Positive: Using
|
Pull Request Review: Persistent File Caching for UploadsSummaryThis PR adds persistent caching functionality for uploaded files using the ✅ Strengths
🔴 Critical Issues1. Race Condition in
|
|
@claude do these peotry.lock changes seem right? |
No description provided.